264

17

Genomics

viruses, 20 which generally have very compact genomes. However, the reading frames

of eukaryotes are generally nonoverlapping (i.e., only the triplets AAG, TTC, …

would be available).

Due to the absence of unambiguous separators, the available structural information

in DNA is much more basic than in a human language. Even if the “meaning” of a

DNA sequence (a gene) that corresponds to a functional protein might be more

or less clear, especially in the case of prokaryotes, it must be remembered that

the sequence may be shot through with introns; even the stop codons (Table 7.1)

are not unambiguous. Only a small fraction (a few per cent) of eukaryotic genome

sequences actually corresponds to proteins (cf. Table 14.2), and any serious attempt to

understand the semantics of the genome must encompass the totality of its sequence.

Nucleotide Frequencies

Due to the lack of separators, it is necessary to work withnn-grams rather than words

as such. Basic information about the sequence is encapsulated in the frequency

dictionaries upper W Subscript nWn of the nn-grams, (i.e., lists of the numbers of occurrences of each

possible nn-gram). Each sequence can then be plotted as a point in upper M Superscript nMn-dimensional

space, where upper MM is the number of letters in the alphabet (equals 4= 4 for DNA, or 5 if we

include methylated cytosine as a distinct base).

Even such very basic information can be used to distinguish between different

genomes; for example, thermophilic organisms are generally richer in C and G,

because the C–G base-pairing is stronger and, hence, stabler at higher temperatures

than A–T. Furthermore, since each genome corresponds to a point in a particular

space, distances between them can be determined, and phylogenetic trees can be

assembled.

The four-dimensional space corresponding to the single base-pair frequencies is

not perhaps very interesting. Already the 16-dimensional space corresponding to the

dinucleotide frequencies is richer and might be expected to be more revealing. In

particular, given the single base-pair frequencies, one can compute the dinucleotide

frequencies expected from random assembly of the genome and determine diver-

gences from randomness. Dinucleotide bias is assessed, for example, by the odds

ratio w Subscript normal upper XρXY = wXY(wXwY), where w Subscript normal upper XwX is the frequency of nucleotide X. 21 We will

return to this comparison of actual with expected frequencies below.

Instead of representing the entire genome by a single point, one can divide it up

into roughly gene-long fragments (100–1000 base pairs), determine their frequency

dictionaries, and apply some kind of clustering algorithm to the collection of points

thereby generated. Alternatively, dimensional reduction using principal component

analysis (Sect. 13.2.2) may be adequate. The distributions of a single base-pair and

dinucleotide frequencies look like Gaussian clouds, but the triplet frequencies reveal

a remarkable seven-cluster structure. 22 It is natural to interpret the seven clusters as

the six possible reading frames (three in each direction) plus the “noncoding” DNA.

20 For example, Zaaijer et al. (2007).

21 See, e.g., Karlin et al. (1994).

22 Gorban et al. (2005).